Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point's permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios. The code and dataset will be made available upon acceptance.
translated by 谷歌翻译
给定的用户输入的自动生成平面图在建筑设计中具有很大的潜力,最近在计算机视觉社区中探索了。但是,大多数现有方法以栅格化图像格式合成平面图,这些图像很难编辑或自定义。在本文中,我们旨在将平面图合成为1-D向量的序列,从而简化用户的互动和设计自定义。为了产生高保真矢量化的平面图,我们提出了一个新颖的两阶段框架,包括草稿阶段和多轮精炼阶段。在第一阶段,我们使用图形卷积网络(GCN)编码用户的房间连接图输入,然后应用自回归变压器网络以生成初始平面图。为了抛光最初的设计并生成更具视觉吸引力的平面图,我们进一步提出了一个由GCN和变压器网络组成的新颖的全景精炼网络(PRN)。 PRN将初始生成的序列作为输入,并完善了平面图设计,同时鼓励我们提出的几何损失来鼓励正确的房间连接。我们已经对现实世界平面图数据集进行了广泛的实验,结果表明,我们的方法在不同的设置和评估指标下实现了最先进的性能。
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Ithaca is a Fuzzy Logic (FL) plugin for developing artificial intelligence systems within the Unity game engine. Its goal is to provide an intuitive and natural way to build advanced artificial intelligence systems, making the implementation of such a system faster and more affordable. The software is made up by a C\# framework and an Application Programming Interface (API) for writing inference systems, as well as a set of tools for graphic development and debugging. Additionally, a Fuzzy Control Language (FCL) parser is provided in order to import systems previously defined using this standard.
translated by 谷歌翻译
Grasping is an incredible ability of animals using their arms and limbs in their daily life. The human hand is an especially astonishing multi-fingered tool for precise grasping, which helped humans to develop the modern world. The implementation of the human grasp to virtual reality and telerobotics is always interesting and challenging at the same time. In this work, authors surveyed, studied, and analyzed the human hand-grasping behavior for the possibilities of haptic grasping in the virtual and remote environment. This work is focused on the motion and force analysis of fingers in human hand grasping scenarios and the paper describes the transition of the human hand grasping towards a tripod haptic grasp model for effective interaction in virtual reality.
translated by 谷歌翻译
This short report reviews the current state of the research and methodology on theoretical and practical aspects of Artificial Neural Networks (ANN). It was prepared to gather state-of-the-art knowledge needed to construct complex, hypercomplex and fuzzy neural networks. The report reflects the individual interests of the authors and, by now means, cannot be treated as a comprehensive review of the ANN discipline. Considering the fast development of this field, it is currently impossible to do a detailed review of a considerable number of pages. The report is an outcome of the Project 'The Strategic Research Partnership for the mathematical aspects of complex, hypercomplex and fuzzy neural networks' meeting at the University of Warmia and Mazury in Olsztyn, Poland, organized in September 2022.
translated by 谷歌翻译
Nonconvex-nonconcave minimax optimization has been the focus of intense research over the last decade due to its broad applications in machine learning and operation research. Unfortunately, most existing algorithms cannot be guaranteed to converge and always suffer from limit cycles. Their global convergence relies on certain conditions that are difficult to check, including but not limited to the global Polyak-\L{}ojasiewicz condition, the existence of a solution satisfying the weak Minty variational inequality and $\alpha$-interaction dominant condition. In this paper, we develop the first provably convergent algorithm called doubly smoothed gradient descent ascent method, which gets rid of the limit cycle without requiring any additional conditions. We further show that the algorithm has an iteration complexity of $\mathcal{O}(\epsilon^{-4})$ for finding a game stationary point, which matches the best iteration complexity of single-loop algorithms under nonconcave-concave settings. The algorithm presented here opens up a new path for designing provable algorithms for nonconvex-nonconcave minimax optimization problems.
translated by 谷歌翻译
In this paper, we present an evolved version of the Situational Graphs, which jointly models in a single optimizable factor graph, a SLAM graph, as a set of robot keyframes, containing its associated measurements and robot poses, and a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational information between those elements. Our proposed S-Graphs+ is a novel four-layered factor graph that includes: (1) a keyframes layer with robot pose estimates, (2) a walls layer representing wall surfaces, (3) a rooms layer encompassing sets of wall planes, and (4) a floors layer gathering the rooms within a given floor level. The above graph is optimized in real-time to obtain a robust and accurate estimate of the robot's pose and its map, simultaneously constructing and leveraging the high-level information of the environment. To extract such high-level information, we present novel room and floor segmentation algorithms utilizing the mapped wall planes and free-space clusters. We tested S-Graphs+ on multiple datasets including, simulations of distinct indoor environments, on real datasets captured over several construction sites and office environments, and on a real public dataset of indoor office environments. S-Graphs+ outperforms relevant baselines in the majority of the datasets while extending the robot situational awareness by a four-layered scene model. Moreover, we make the algorithm available as a docker file.
translated by 谷歌翻译
The field of Automatic Music Generation has seen significant progress thanks to the advent of Deep Learning. However, most of these results have been produced by unconditional models, which lack the ability to interact with their users, not allowing them to guide the generative process in meaningful and practical ways. Moreover, synthesizing music that remains coherent across longer timescales while still capturing the local aspects that make it sound ``realistic'' or ``human-like'' is still challenging. This is due to the large computational requirements needed to work with long sequences of data, and also to limitations imposed by the training schemes that are often employed. In this paper, we propose a generative model of symbolic music conditioned by data retrieved from human sentiment. The model is a Transformer-GAN trained with labels that correspond to different configurations of the valence and arousal dimensions that quantitatively represent human affective states. We try to tackle both of the problems above by employing an efficient linear version of Attention and using a Discriminator both as a tool to improve the overall quality of the generated music and its ability to follow the conditioning signals.
translated by 谷歌翻译